perf: speed up MemoryStream IPC stream reads#340
perf: speed up MemoryStream IPC stream reads#340CurtHagenlocher merged 6 commits intoapache:mainfrom
Conversation
Use exposed MemoryStream buffers for stream reader metadata reads while preserving reader-owned record batch body buffers. Keep exact-length memory-reader continuation handling and add focused stream reader coverage for exact slices, fallback streams, allocator behavior, cancellation, and aliasing.
Cover exposed, non-public, managed-allocator, and explicit default-allocator MemoryStream reader scenarios across row and column counts. BenchmarkDotNet ShortRun (ArrowReaderBenchmark): 100000 rows / 1 col improved 21629.3 us to 7707.6 us with managed memory; 100000 rows / 5 cols improved 91112.3 us to 40137.5 us.
…eam-reader-managed-memory
Avoid Stream.ReadExactly in compression reader tests because Windows CI builds net462 and net472, where that API is unavailable.
CurtHagenlocher
left a comment
There was a problem hiding this comment.
It feels like there are now effectively two different implementations of ArrowMemoryReaderImplementation: one for MemoryStreams and one for every other kind of stream. Given that this is an internal class, would it make more sense to have a separate class ArrowMemoryStreamReaderImplementation : ArrowMemoryReaderImplementation that handles the MemoryStream-specific flavor?
That makes sense. The current change does make ArrowStreamReaderImplementation carry both the general stream path and the exposed MemoryStream path. I’ll move the MemoryStream-specific logic into a separate internal implementation while preserving the existing stream-reader ownership semantics for record batch bodies. |
Move the exposed MemoryStream-specific stream reader path into a dedicated internal implementation while keeping stream-reader body ownership semantics.
CurtHagenlocher
left a comment
There was a problem hiding this comment.
Thanks again! My only real concern now is that some tests appear to have been inadvertently deleted from ArrowStreamReaderTests. These should be put back.
Summary
This improves
ArrowStreamReaderwhen reading fromMemoryStreaminstances that expose their underlying buffer. The reader now uses the exposed buffer for IPC message/schema metadata reads while preserving the existing reader-owned body-buffer boundary.The change is intentionally scoped to MemoryStream-backed IPC stream reads:
MemoryStreamcan use the fast pathMemoryStreamand partial-read streams continue through the fallback stream-read pathArrowMemoryReaderexact-length continuation-token handling is corrected for complete in-memory buffersBenchmark
BenchmarkDotNet ShortRun,
ArrowReaderBenchmark:ArrowReaderWithMemoryStream_ManagedMemory, 100000 rows / 1 columnArrowReaderWithMemoryStream_ManagedMemory, 100000 rows / 5 columnsValidation
dotnet test test/Apache.Arrow.Tests/Apache.Arrow.Tests.csproj -c Release --filter "FullyQualifiedName~Apache.Arrow.Tests.ArrowStreamReaderTests"dotnet test test/Apache.Arrow.Compression.Tests/Apache.Arrow.Compression.Tests.csproj -c Release --filter "FullyQualifiedName~Apache.Arrow.Compression.Tests.ArrowStreamReaderTests"dotnet build Apache.Arrow.sln -c Release